NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SelfCodeAlign: Self-Alignment for Code Generation

Wei, Yuxiang; Cassano, Federico; Liu, Jiawei; Ding, Yifeng; Jain, Naman; Mueller, Zachary; de_Vries, Harm; von_Werra, Leandro; Guha, Arjun; Zhang, Lingming (December 2024, NeurIPS 2024)

Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. For programming tasks, most models are finetuned with costly human-annotated instruction-response pairs or those generated by large, proprietary LLMs, which may not be permitted. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component’s effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance. Overall, SelfCodeAlign shows for the first time that a strong instruction-tuned code LLM can result from self-alignment rather than distillation.
more » « less
Full Text Available
SelfCodeAlign: Self-Alignment for Code Generation

Wei, Yuxiang; Cassano, Federico; Liu, Jiawei; Ding, Yifeng; Jain, Naman; Mueller, Zachary; Vries, Harm de; Werra, Leandro Von; Guha, Arjun; Zhang, Lingming (December 2024, NeurIPS 2024 (Curran Associates))

Full Text Available
Revisiting Prompt Engineering via Declarative Crowdsourcing

Parameswaran, Aditya; Shankar, Shreya; Asawa, Parth; Jain, Naman; Wang, Yujie (January 2024, CIDR)

Full Text Available
LLM-Assisted Code Cleaning For Training Accurate Code Generators.

Jain, Naman; Zhang, Tianjun; Chiang, Wei-Lin; Gonzalez, Joseph E; Sen, Koushik; Stoica, Ion (October 2023, arXiv)

Full Text Available
FRB 20250316A: A Brilliant and Nearby One-off Fast Radio Burst Localized to 13 pc Precision

https://doi.org/10.3847/2041-8213/adf62f

Abbott, Thomas C; Amouyal, Daniel; Andersen, Bridget C; Andrew, Shion E; Bandura, Kevin; Bhardwaj, Mohit; Bhopi, Kalyani; Bhusare, Yash; Brar, Charanjot; Cai, Alice; et al (August 2025, The Astrophysical Journal Letters)

Abstract Precise localizations of a small number of repeating fast radio bursts (FRBs) using very long baseline interferometry (VLBI) have enabled multiwavelength follow-up observations revealing diverse local environments. However, the 2%–3% of FRB sources that are observed to repeat may not be representative of the full population. Here we use the VLBI capabilities of the full CHIME Outrigger array for the first time to localize a nearby (40 Mpc), bright (kJy), and apparently one-off FRB source, FRB 20250316A, to its environment on 13 pc scales. We use optical and radio observations to place deep constraints on associated transient emission and the properties of its local environment. We place a 5σupper limit ofL_{9.9 GHz} < 2.1 × 10²⁵erg s⁻¹Hz⁻¹on spatially coincident radio emission, a factor of 100 lower than any known compact persistent radio source associated with an FRB. Our Keck Cosmic Webb Imager observations allow us to characterize the gas density, metallicity, nature of gas ionization, dust extinction, and star formation rate through emission line fluxes. We leverage the exceptional brightness and proximity of this source to place deep constraints on the repetition of FRB 20250316A and find that it is inconsistent with all well-studied repeaters given the nondetection of bursts at lower spectral energies. We explore the implications of a measured offset of 190 ± 20 pc from the center of the nearest star formation region in the context of progenitor channels. FRB 20250316A marks the beginning of an era of routine localizations for one-off FRBs on tens of milliarcseconds scales, enabling large-scale studies of their local environments.
more » « less
Full Text Available
A Repeating Fast Radio Burst Source in the Outskirts of a Quiescent Galaxy

https://doi.org/10.3847/2041-8213/ad9ddc

Shah, Vishwangi; Shin, Kaitlyn; Leung, Calvin; Fong, Wen-fai; Eftekhari, Tarraneh; Amiri, Mandana; Andersen, Bridget C; Andrew, Shion; Bhardwaj, Mohit; Brar, Charanjot; et al (January 2025, The Astrophysical Journal Letters)

Abstract We report the discovery of the repeating fast radio burst (FRB) source FRB 20240209A using the Canadian Hydrogen Intensity Mapping Experiment (CHIME)/FRB telescope. We detected 22 bursts from this repeater between 2024 February and July, 6 of which were also recorded at the Outrigger station k’niʔatn k’l_⌣stk’masqt (KKO). The multiple very long baseline interferometry localizations using the 66 km long CHIME–KKO baseline, each with a different baseline vector orientation due to the repeater’s high decl. of ∼86°, enabled the combined localization region to be constrained to 1″ × 2″. We present deep Gemini optical observations that, combined with the FRB localization, enabled a robust association of FRB 20240209A to the outskirts of a luminous galaxy (P(O∣x) = 0.99;L ≈ 5.3 × 10¹⁰L_⊙). FRB 20240209A has a projected physical offset of 40 ± 5 kpc from the center of its host galaxy, making it the FRB with the largest host galaxy offset to date. When normalized by the host galaxy size, the offset of FRB 20240209A (5.1R_eff) is comparable to that of FRB 20200120E (5.7R_eff), the only FRB source known to originate in a globular cluster. We consider several explanations for the large offset, including a progenitor that was kicked from the host galaxy or in situ formation in a low-luminosity satellite galaxy of the putative host, but find the most plausible scenario to be a globular cluster origin. This, coupled with the quiescent, elliptical nature of the host as demonstrated in our companion Letter, provides strong evidence for a delayed formation channel for the progenitor of the FRB source.
more » « less
Full Text Available
Measuring the predictability of life outcomes with a scientific mass collaboration

https://doi.org/10.1073/pnas.1915006117

Salganik, Matthew J.; Lundberg, Ian; Kindel, Alexander T.; Ahearn, Caitlin E.; Al-Ghoneim, Khaled; Almaatouq, Abdullah; Altschul, Drew M.; Brand, Jennie E.; Carnegie, Nicole Bohme; Compton, Ryan James; et al (April 2020, Proceedings of the National Academy of Sciences)

How predictable are life trajectories? We investigated this question with a scientific mass collaboration using the common task method; 160 teams built predictive models for six life outcomes using data from the Fragile Families and Child Wellbeing Study, a high-quality birth cohort study. Despite using a rich dataset and applying machine-learning methods optimized for prediction, the best predictions were not very accurate and were only slightly better than those from a simple benchmark model. Within each outcome, prediction error was strongly associated with the family being predicted and weakly associated with the technique used to generate the prediction. Overall, these results suggest practical limits to the predictability of life outcomes in some settings and illustrate the value of mass collaborations in the social sciences.
more » « less
Full Text Available

Search for: All records